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Abstract 

This paper describes an experimental compari- 
son of seven different learning algorithms on the 
problem of learning to disambiguate the meaning 
of a word from context. The algorithms tested 
include statistical, neural-network, decision-tree, 
rule-based, and case-based classification tech- 
niques. The specific problem tested involves dis- 
ambiguating six senses of the word "line" using 
the words in the current and proceeding sentence 
as context. The statistical and neural-network 
methods perform the best on this particular prob- 
lem and we discuss a potential reason for this ob- 
served difference. We also discuss the role of hms 
in machine learning and its importance in explain- 
ing performance differences observed on specific 
problems. 

Introduction 

Recent research in empirical (corpus-based) natu- 
ral language processing has explored a number of 
different methods for learning from data. Three 
general approaches are statistical, neural-network, 
and symbolic machine learning and numerous spe- 
cific methods have been developed under each 
of these paradigms (Wermter, Riloff, & Scheler, 
1996; Charniak, 1993; Reilly & Sharkey, 1992). 
An important question is whether some methods 
perform significantly better than others on partic- 
ular types of problems. Unfortunately, there have 
been very few direct comparisons of alternative 
methods on identical test data. 

A somewhat indirect comparison of apply- 
ing stochastic context-free grammars (Periera & 
Shabes, 1992), a transformation-based method 
(Brill, 1993), and inductive logic program- 
ming (Zelle & Mooney, 1994) to parsing the 
ATIS (Airline Travel Information Service) cor- 
pus from the Penn Treebank (Marcus, Santorini, 
& Marcinkiewicz, 1993) indicates fairly similar 
performance for these three very different meth- 
ods. Also, comparisons of Bayesian, information- 



retrieval, neural-network, and case-based methods 
on word-sense disambiguation have also demon- 
strated similar performance (Leacock, Towell, & 
Voorhees, 1993b; Lehman, 1994). However, in 
a comparison of neural-network and decision-tree 
methods on learning to generate the past tense 
of an English verb, decision trees performed sig- 
nificantly better (Ling & Marinov, 1993; Ling, 
1994). Subsequent experiments on this problem 
have demonstrated that an inductive logic pro- 
gramming method produces even better results 
than decision trees (Mooney & Califf, 1995). 

In this paper, we present direct comparisons 
of a fairly wide range of general learning algo- 
rithms on the problem of discriminating six senses 
of the word "line" from context, using data as- 
sembled by Leacock et al. (1993b). We compare 
a naive Bayesian classifier (Duda & Hart, 1973), 
a perceptron (Rosenblatt, 1962), a decision-tree 
learner (Quinlan, 1993), a k nearest-neighbor clas- 
sifier (Cover & Hart, 1967), logic-based DNF (dis- 
junctive normal form) and CNF (conjunctive nor- 
mal form) learners (Mooney, 1995) and a decision- 
list learner (Rivest, 1987). Tests on all methods 
used identical training and test sets, and ten sep- 
arate random trials were run in order to measure 
average performance and allow statistical testing 
of the significance of any observed differences. On 
this particular task, we found that the Bayesian 
and perceptron methods perform significantly bet- 
ter than the remaining methods and discuss a po- 
tential reason for this observed difference. We also 
discuss the role of hms in machine learning and its 
importance in explaining the observed differences 
in the performance of alternative methods on spe- 
cific problems. 

Background on Machine Learning 
and Bias 

Research in machine learning over the last ten 
years has been particularly concerned with exper- 
imental comparisons and the relative performance 
of different classification methods (Shavlik & Di- 



etterich, 1990; Kulikowski & Weiss, 1991; Langley, 
1996). In particular, the UCI Machine Learning 
Data Repository (Merz, Murphy, & Aha, 1996) 
was assembled to facilitate empirical comparisons. 
Experimental comparisons of different methods on 
various benchmark problems have generally found 
relatively small differences in predictive accuracy 
(Mooney, Shavlik, Towell, & Gove, 1989; Fisher & 
McKusick, 1989; Weiss & Kapouleas, 1989; Atlas, 
Cole, Conner, El-Sharkawi, Marks, Muthusamy, 
& Bernard, 1990; Dietterich, Hild, & Bakiri, 1990; 
Kulikowski & Weiss, 1991; Shavlik, Mooney, & 
Towell, 1991; Holte, 1993). However, on specific 
problems, certain methods can demonstrate a sig- 
nificant advantage. For example, on the problem 
of detecting promoter sequences in DNA (which 
indicate the start of a new gene), neural-network 
and similar methods perform significantly better 
than symbolic induction methods (Towell, Shav- 
lik, & Noordewier, 1990; Baffes & Mooney, 1993). 
On the other hand, as mentioned in the introduc- 
tion, symbolic induction methods perform signifi- 
cantly better than neural-networks on the problem 
of learning to generate the past tense of an English 
verb (Ling & Marinov, 1993; Ling, 1994; Mooney 
& Califf, 1995). 

It is generally agreed that the philosophical 
problem of induction (Hume, 1748) means that 
no inductive algorithm is universally better than 
any other. It can be proven that when averaged 
over a uniform distribution of all possible classi- 
fication problems, the generalization performance 
(predictive accuracy on unseen examples) of any 
inductive algorithm is zero. This has been called 
the "Conservation Law for Generalization Perfor- 
mance" (Schaffer, 1994) or a "no free lunch" the- 
orem (Wolpert, 1992). However, averaging over 
a uniform distribution of all possible functions is 
effectively equivalent to assuming a "random uni- 
verse" in which the past is not predictive of the 
future. If all problems are not equally likely, the 
expected generalization performance over a distri- 
bution of real- world problems can of course be pos- 
itive (Rao, Gordon, & Spears, 1995). 

In machine learning, bias refers to "any ba- 
sis for choosing one generalization over another, 
other than strict consistency with the instances" 
(Mitchell, 1980). Decision-tree methods have 
a bias for simple decision trees, rule induction 
methods have a bias for simple DNF expressions, 
neural-network methods have a bias for linear 
threshold functions, ^ and naive Bayes has a bias 
for functions which respect conditional indepen- 
dence of features. The more the bias of a certain 



^Although multi-layer networks with sufficient hid- 
den can represent arbitrary nonlinear functions, they 
will tend to learn a linear function if one exists that is 
consistent with the training data. 



learning algorithm fits the characteristics of a par- 
ticular problem, the better it will perform on that 
problem. Most learning algorithms have some sort 
of "Occam's razor" bias in which hypotheses that 
can be represented with fewer bits in some particu- 
lar representation language are preferred (Blumer, 
Ehrenfeucht, Haussler, & Warmuth, 1987). How- 
ever, the compactness with which different repre- 
sentation languages (e.g. decision trees, DNF, lin- 
ear threshold networks) can represent particular 
functions can vary dramatically (e.g. see Pagallo 
and Haussler (1990)). Therefore, different biases 
can perform better or worse on specific problems. 
One of the main goals of machine learning is to 
find biases that perform well on the distribution 
of problems actually found in the real world. 

As an example, consider the advantage neural- 
networks have on the promoter recognition prob- 
lem mentioned earlier. There are several potential 
sites where hydrogen bonds can form between the 
DNA and a protein and if enough of these bonds 
form, promoter activity can occur. This is rep- 
resented most compactly using an M-of-N classi- 
fication function which returns true if any subset 
of size M of N specified features are present in 
an example (Fisher & McKusick, 1989; Murphy 
& Pazzani, 1991; Baffes & Mooney, 1993). A sin- 
gle linear threshold unit can easily represent such 
functions, whereas a DNF expression requires "N 
choose M" terms to represent them. Therefore, 
the difference in their ability to compactly rep- 
resent such functions explains the observed per- 
formance difference between rule induction and 
neural-networks on this problem. ^ 

Of course picking the right bias or learning al- 
gorithm for a particular task is a difficult problem. 
A simple approach is to automate the selection of 
a method using internal cross-validation (Schaffer, 
1993). Another approach is to use meta-learning 
to learn a set of rules (or other classifier) that pre- 
dicts when a learning algorithm will perform best 
on a domain given features describing the problem 
(Aha, 1992). A recent special issue of the Machine 
Learning journal on "Bias Evaluation and Selec- 
tion" introduced by Gordon and desJardins (1995) 
presents current research in this general area. 

Learning to Disambiguate Word 
Senses 

Several recent research projects have taken a 
corpus-based approach to lexical disambiguation 
(Brown, Della-Pietra, Della-Pietra, & Mercer, 
1991; Gale, Church, & Yarowsky, 1992b; Leacock 
et al., 1993b; Lehman, 1994). The goal is to learn 



^This explanation was originally presented by 
Shavlik et al. (1991). 



to use surrounding context to determine the sense 
of an ambiguous word. Our tests are based on the 
corpus assembled by Leacock et al. (1993b). The 
task is to disambiguate the word "line" into one 
of six possible senses (text, formation, division, 
phone, cord, product) based on the words occur- 
ring in the current and previous sentence. The cor- 
pus was assembled from the 1987-89 Wall Street 
Journal and a 25 million word corpus from the 
American Printing House for the Blind. Sentences 
containing "line" were extracted and assigned a 
single sense from WordNet (Miller, 1991). There 
are a total of 4,149 examples in the full corpus un- 
equally distributed across the six senses. Due to 
the use of the Wall Street Journal, the "product" 
sense is more than 5 times as common as any of 
the others. Previous studies have first sampled the 
data so that all senses were equally represented. 

Leacock et al. (1993b), Leacock, Towell, 
and Voorhees (1993a) and Voorhees, Leacock, 
and Towell (1995) present results on a Bayesian 
method (Gale, Church, & Yarowsky, 1992a), a 
content vector method from information retrieval 
(Salton, Wong, & Yang, 1975), and a neural net- 
work trained using backpropagation (Rumelhart, 
Hinton, & Williams, 1986). The neural network 
architecture that performed at least as well as any 
other contained no hidden units, so was effectively 
equivalent to a perceptron. On the six-sense task 
trained on 1,200 examples and averaged over three 
random trials, they report the following general- 
ization accuracies: Bayesian, 71%; content vec- 
tors, 72%; neural nets, 76%. None of these differ- 
ences were statistically significant given the small 
number of trials. 

In these studies, the data for the content- 
vector and neural-network methods was first re- 
duced by ignoring case and reducing words to 
stems (e.g. computer(s), computing, computa- 
tton(al), etc. are all confiated to the feature 
comput) and removing a set of about 570 high- 
frequency stopwords (e.g. the, by, you, etc.). Sim- 
ilar preprocessing was performed for the current 
experiments, but we can not guarantee identical 
results. The result was a set of 2,094 examples 
equally distributed across the six senses where 
each example was described using 2,859 binary 
features each representing the presence or absence 
of a particular word stem in the current or imme- 
diately preceding sentence. 

Learning Algorithms Tested 

The current experiments test a total of seven 
different learning algorithms with quite dif- 
ferent biases. This section briefiy describes 
each of these algorithms. Except for C4.5, 
which uses the C code provided by Quin- 



lan (1993), all of these methods are imple- 
mented in Common Lisp and available on-line at 
http : //¥¥¥. cs .utexas . edu/users/ml/ml-progs .html 
All systems were run on a Sun SPARCstation 5 
with 40MB of main memory. 

The simplest algorithms tested were a naive 
Bayesian classifier which assumes conditional in- 
dependence of features and a k nearest-neighbor 
classifier, which assigns a test example to the 
majority class of the 3 closest training examples 
(using Hamming distance to measure closeness) 
(Duda & Hart, 1973; Kulikowski & Weiss, 1991). 
Initial results indicated that k nearest neighbor 
with k=3 resulted in slightly better performance 
than k=l. Naive Bayes is intended as a simple 
representative of statistical methods and nearest 
neighbor as a simple representative of instance- 
based (case-based, exemplar) methods (Cover & 
Hart, 1967; Aha, Kibler, & Albert, 1991). 

Since the previous results of Leacock et al. 
(1993b) indicated that neural networks did not 
benefit from hidden units on the "line" disam- 
biguation data, we employed a simple perceptron 
(Rosenblatt, 1962) as a representative connection- 
ist method. The implementation learns a separate 
perceptron for recognizing each sense and assigns 
a test case to the sense indicated by the perceptron 
whose output most exceeds its threshold. In the 
current experiments, there was never a problem 
with convergence during training. 

As a representative of decision-tree methods, 
we chose C4.5 (Quinlan, 1993), a system that is 
easily available and included in most recent exper- 
imental comparisons in machine learning. All pa- 
rameters were left at their default values. We also 
tested C4.5-RULES, a variant of C4.5 in which de- 
cision trees are translated into rules and pruned; 
however, its performance was slightly inferior to 
the base C4.5 system on the "line" corpus; there- 
fore, its results are not included. 

Finally, we tested three simple logic-based in- 
duction algorithms that employ different represen- 
tations of concepts: DNF, CNF, and decision lists. 
Most rule-based methods, e.g. Michalski (1983), 
induce a disjunctive set of conjunctive rules and 
therefore represent concepts in DNF. Some recent 
results have indicated that representing concepts 
in CNF (a conjunction of disjunctions) frequently 
performs somewhat better (Mooney, 1995). Some 
concepts are more compactly represented in CNF 
compared to DNF and vice versa. Therefore, 
both representations are included. Finally, deci- 
sion lists (Rivest, 1987) are ordered lists of con- 
junctive rules, where rules are tested in order and 
the first one that matches an instance is used to 
classify it. A number of effective concept-learning 
systems have employed decision lists (Clark & 



Niblett, 1989; Quinlan, 1993; Mooney & Califf, 
1995) and they have already been successfully ap- 
plied to lexical disambiguation (Yarowsky, 1994). 

All of the logic-based methods are variations 
of the Foil algorithm for induction of first-order 
function-free Horn clauses (Quinlan, 1990), ap- 
propriately simplified for the propositional case. 
They are called PFoiL-DNF, PFoiL-CNF, and 
PFoiL-DLlST. The algorithms are greedy cov- 
ering (separate-and-conquer) methods that use 
an information-theoretic heuristic to guide a top- 
down search for a simple definition consistent with 
the training data. PFoiL-DNF (PFoiL-CNF) 
learns a separate DNF (CNF) description for each 
sense using the examples of that sense as posi- 
tive instances and the examples of all other senses 
as negative instances. Mooney (1995) describes 
PFoiL-DNF and PFoiL-CNF in more detail and 
PFoiL-DLlST is based on the first-order decision- 
list learner described by Mooney and Califf (1995). 

Experiments 

In order to evaluate the performance of these seven 
algorithms, direct multi-trial comparisons on iden- 
tical training and test sets were run on the "line" 
corpus. Such head-to-head comparisons of meth- 
ods are unfortunately relatively rare in the empiri- 
cal natural-language literature, where papers gen- 
erally report results of a single method on a single 
training set with, at best, indirect comparisons to 
other methods. 

Experimental Methodology 

Learning curves were generated by splitting the 
preprocessed "line" corpus into 1,200 training ex- 
amples and 894 test cases, training all methods 
on an increasingly larger subset of the training 
data and repeatedly testing them on the test 
set. Learning curves are fairly common in ma- 
chine learning but not in corpus-based language 
research. We believe they are important since 
they reveal how algorithms perform with varying 
amounts of training data and how their perfor- 
mance improves with additional training. Results 
on a fixed-sized training set gives only one data 
point on the learning curve and leaves the possi- 
bility that differences between algorithms are hid- 
den due to a ceiling effect, in which there are 
sufficient training examples for all methods to 
reach near Bayes-optimal performance.^ Learning 

^Bayes-optimal performance is achieved by always 
picking the category with the maximum probability 
given all of its features. This requires actually knowing 
the conditional probability of each category given each 
of the exponentially large number of possible instance 
descriptions. 



curves generally follow a power law where predic- 
tive accuracy climbs fairly rapidly and then lev- 
els off at an asymptotic level. A learning curve 
can reveal whether the performance of a system is 
approaching an asymptote or whether additional 
training data would likely result in significant im- 
provement. Since gathering annotated training 
data is an expensive time-consuming process, it is 
important to understand the performance of meth- 
ods given varying amounts of training data. 

In addition to measuring generalization accu- 
racy, we also collected data on the CPU time taken 
to train and test each method for each training- 
set size measured on the learning curve. This pro- 
vides information on the computational resources 
required by each method, which may also be useful 
in deciding between them for particular applica- 
tions. It also provides data on how the algorithm 
scales by providing information on how training 
time grows with training-set size. 

Finally, all results are averaged over ten ran- 
dom selections of training and test sets. The per- 
formance of a system can vary a fair bit from trial 
to trial, and a difference in accuracy on a sin- 
gle training set may not indicate an overall per- 
formance advantage. Unfortunately, most results 
reported in empirical natural-language research 
present only a single or very small number of tri- 
als. Running multiple trials also allows for sta- 
tistical testing of the significance of any resulting 
differences in average performance. We employ 
a simple two-tailed, paired t-test to compare the 
performance of two systems for a given training- 
set size, requiring significance at the 0.05 level. 
Even more sophisticated statistical analysis of the 
results is perhaps warranted. 

Experimental Results 

The resulting learning curves are shown in Fig- 
ure 1 and results on training and testing time are 
shown in Figures 2 and 3. Figure 3 presents the 
time required to classify the complete set of 894 
test examples. 

With respect to accuracy, naive Bayes and 
perceptron perform significantly better (p < 0.05) 
than all other methods for all training-set sizes. 
Naive Bayes and perceptron are not significantly 
different, except at 1,200 training examples where 
naive Bayes has a slight advantage. Note that the 
results for 1,200 training examples are compara- 
ble to those obtained by Leacock et al. (1993b) for 
similar methods. PFoiL-DLlST is always signifi- 
cantly better than PFoiL-DNF and PFoiL-CNF 
and significantly better than 3 Nearest Neighbor 
and C4.5 at 600 and 1,200 training examples. C4.5 
and 3 Nearest Neighbor are always significantly 
better than PFoiL-DNF and PFoiL-CNF but 
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Figure 1: Accuracy at Disambiguating "Line' 



not significantly different from each other. Finally, 
PFoiL-DNF is significantly better than PFoiL- 
CNF at 600 and 1,200 training examples. 

With respect to training time, virtually all dif- 
ferences are significant. The logic-based induction 
methods are slowest, C4.5 and perceptron inter- 
mediate, and naive Bayes the fastest. Since it just 
stores examples, training time for Nearest Neigh- 
bor is always zero. In general, connectionist meth- 
ods are much slower to train than alternative tech- 
niques (Shavlik et al., 1991); however, in this case 
a simple perceptron converges quite rapidly. 

With respect to testing time, the symbolic in- 
duction methods are fastest and almost indistin- 
guishable from zero in Figure 3 since they only 
need to test a small subset of the features. ■* 
All visible differences in the graph are significant. 
Naive Bayes is the slowest; both it and percep- 
tron have the constant overhead of computing a 
weighted function over all of the almost 3,000 fea- 
tures. Nearest neighbor grows linearly with the 
number of training instances as expected; more 
sophisticated indexing methods can reduce this to 
logarithmic expected time (Friedman, Bentley, & 
Finkel, 1977). 5 



^C4.5 suffers a small constant overhead due to the 
C code having to read the test data in from a separate 
file. 

^It should be noted that the implementation of 
nearest neighbor was optimized to handle sparse bi- 
nary vectors by only including and comparing the fea- 



Discussion of Results 

Naive Bayes and perceptron are similar in that 
they both employ a weighted combination of all 
features. The decision-tree and logic-based ap- 
proaches all attempt to find a combination of a rel- 
atively small set of features that accurately predict 
classification. After training on 1,200 examples, 
the symbolic structures learned for the line corpus 
are relatively large. Average sizes are 369 leaves 
for C4.5 decision trees, 742 literals for PFoiL- 
DLiST decision lists, 841 literals for PFoiL-DNF 
formulae, and 1197 literals for PFoiL-CNF for- 
mulae. However, many nodes or literals can test 
the same feature and the last two results include 
the total literal count for six separate DNF or 
CNF formulae (one for each sense). Therefore, 
each discrimination is clearly only testing a rel- 
atively small fraction of the 2,859 available fea- 
tures. Nearest neighbor bases its classifications on 
all features; however, it weights them all equally. 
Therefore, differential weighting is apparently nec- 
essary for high-performance on this problem. Al- 
ternative instance-based methods that weight fea- 
tures based on their predictive ability have also 
been developed (Aha et al., 1991). Therefore, our 
results indicate that lexical disambiguation is per- 
haps best performed using methods that combine 
weighted evidence from all of the features rather 

tures actually present in the examples. Without this 
optimization, testing would have been several orders 
of magnitude slower. 
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Figure 2: Training Time for "Line" Corpus 



than making a decision by testing only a small 
subset of highly predictive features. 

Among the other methods tested, decision 
lists seem to perform the best. The ordering of 
rules employed in a decision list in order to sim- 
plify the representation and perform conflict reso- 
lution apparently gives it an advantage over other 
symbolic methods on this task. In addition to the 
results reported by Yarowsky (1994) and Mooney 
and Califf (1995), it provides evidence for the 
utility of this representation for natural-language 
problems. 

With respect to training time, the symbolic 
methods are significantly slower since they are 
searching for a simple declarative representation of 
the concept. Empirically, the time complexity for 
most methods are growing somewhat worse than 
linearly in the number of training examples. The 
worst in this regard are PFoiL-DNF and PFoiL- 
CNF which have a worst-case complexity of O(n^) 
(Mooney, 1995). However, all of the methods are 
able to process fairly large sets of data in reason- 
able time. 

With respect to testing time, the symbolic 
methods perform the best since they only need 
to test a small number of features before making 
a decision. Therefore, in an application where re- 
sponse time is critical, learned rules or decision 
trees could provide rapid classification with only 
a modest decrease in accuracy. Not surprisingly, 
there is a trade-off between training time and test- 



ing time, the symbolic methods spend more effort 
during training compressing the representation of 
the learned concept resulting in a simpler descrip- 
tion that is quicker to test. 



Future Research 



The current results are for only one simple en- 
coding of the lexical disambiguation problem into 
a feature vector representing an unordered set of 
word stems. This paper has focused on explor- 
ing the space of possible algorithms rather than 
the space of possible input representations. Al- 
ternative encodings which exploit positional infor- 
mation, syntactic word tags, syntactic parse trees, 
semantic information, etc. should be tested to de- 
termine the utility of more sophisticated represen- 
tations. In particular, it would be interesting to 
see if the accuracy ranking of the seven algorithms 
is affected by a change in the representation. 

Similar comparisons of a range of algorithms 
should also be performed on other natural lan- 
guage problems such as part-of-speech tagging 
(Church, 1988), prepositional phrase attachment 
(Hindle & Rooth, 1993), anaphora resolution 
(Anoe & Bennett, 1995), etc.. Since the require- 
ments of individual tasks vary, different algorithms 
may be suitable for different sub-problems in nat- 
ural language processing. 
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Figure 3: Testing Time for "Line" Corpus 



Conclusions 

This paper has presented fairly comprehensive ex- 
periments comparing seven quite different empiri- 
cal methods on learning to disambiguate words in 
context. Methods that employ a weighted com- 
bination of a large set of features, such as sim- 
ple Bayesian and neural-network methods, were 
shown to perform better than alternative meth- 
ods such as decision-tree, rule-based, and instance- 
based techniques on the problem of disambiguat- 
ing the word "line" into one of six possible senses 
given the words that appear in the current and 
previous sentence as context. Although different 
learning algorithms can frequently perform quite 
similarly, they all have specific biases in their rep- 
resentation of concepts and therefore can illustrate 
both strengths and weaknesses in particular appli- 
cations. Only rigorous experimental comparisons 
together with a qualitative analysis and explana- 
tion of their results can help determine the appro- 
priate methods for particular problems in natural 
language processing. 
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